Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!
Word Stemming
-
existing stemming method such as NLTK.PorterStem, etc.
-
didn't -> did not, there's -> there is, etc. Mr. -> Mister Mrs. -> ... Ms. -> ...
Other things
-
it seems that it is hard to get useful information using 1-gram
-
URLs in text are often important and is relatively easy to extract.
-
After handing URLs, you can replace "/" and "." with spaces to avoid confusing them with real long words.
-
long words often contain useful information, however, you have to be careful about words of the form "and/or", etc. And do not confuse it with URLs.
-
the idea of keeping upper/lower quantile (e.g., 5%) of long words, 2-grams, etc. is a very good idea